Cynical Selection of Language Model Training Data

نویسنده

  • Amittai Axelrod
چکیده

The Moore-Lewis method of"intelligent selection of language model training data"is very effective, cheap, efficient... and also has structural problems. (1) The method defines relevance by playing language models trained on the in-domain and the out-of-domain (or data pool) corpora against each other. This powerful idea-- which we set out to preserve-- treats the two corpora as the opposing ends of a single spectrum. This lack of nuance does not allow for the two corpora to be very similar. In the extreme case where the come from the same distribution, all of the sentences have a Moore-Lewis score of zero, so there is no resulting ranking. (2) The selected sentences are not guaranteed to be able to model the in-domain data, nor to even cover the in-domain data. They are simply well-liked by the in-domain model; this is necessary, but not sufficient. (3) There is no way to tell what is the optimal number of sentences to select, short of picking various thresholds and building the systems. We present a greedy, lazy, approximate, and generally efficient information-theoretic method of accomplishing the same goal using only vocabulary counts. The method has the following properties: (1) Is responsive to the extent to which two corpora differ. (2) Quickly reaches near-optimal vocabulary coverage. (3) Takes into account what has already been selected. (4) Does not involve defining any kind of domain, nor any kind of classifier. (6) Knows approximately when to stop. This method can be used as an inherently-meaningful measure of similarity, as it measures the bits of information to be gained by adding one text to another.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

اثربخشی آموزش مفاهیم انیگرام بر ملاک‌های همسرگزینی و نگرش نسبت به ازدواج دختران

Background and Objectives: The foundation of any society is built upon marriage and starting a family. However, marriage is a complex and challenging issue, success in which is hinged upon various factors such as attitude towards marriage. This study was carried out to evaluate the effect of training the 9 Enneagram personality types on spouse selection criteria and attitude towards marriage in...

متن کامل

Training Language Teachers: An educational semiotic model

Abstract The changing culture toward multimodality enforces acquiring visual literacy in every aspect of today’s modern life. One of the fields intermingled with using various modes in different variations is language teaching and learning, especially for and by young learners. Young language learners’ (5-12 years old) lack of world experience forces them to make the most use of non-verbal mode...

متن کامل

Intelligent Selection of Language Model Training Data

We address the problem of selecting nondomain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domainspecific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces bet...

متن کامل

An Intelligence-Based Model for Supplier Selection Integrating Data Envelopment Analysis and Support Vector Machine

The importance of supplier selection is nowadays highlighted more than ever as companies have realized that efficient supplier selection can significantly improve the performance of their supply chain. In this paper, an integrated model that applies Data Envelopment Analysis (DEA) and Support Vector Machine (SVM) is developed to select efficient suppliers based on their predicted efficiency sco...

متن کامل

Method of Selecting Training Sets to Build Compact and Efficient Statistical Language Model

For statistical language model training, target task matched corpora are required. However, training corpora sometimes include both target task matched and unmatched sentences. In such a case, training set selection is effective for both model size reduction and model performance improvement. In this paper, training set selection method for statistical language model training is described. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1709.02279  شماره 

صفحات  -

تاریخ انتشار 2017